{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Solving the Frozen Lake Problem with Value Iteration\n",
    "\n",
    "In the previous chapter, we have learned about the frozen lake environment. The frozen\n",
    "lake environment is shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's recap the frozen lake environment a bit. In the frozen lake environment shown above,\n",
    "the following applies:\n",
    "    \n",
    "* S implies the starting state\n",
    "* F implies the frozen states\n",
    "* H implies the hold states\n",
    "* G implies the goal state\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. That is, while trying to reach the goal\n",
    "state G from the starting state S if the agent visits the hole state H then it will fall into the\n",
    "hole and die as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, the goal of the agent is to reach the state G starting from the state S without visiting the\n",
    "hole states H as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/6.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we achieve this goal? That is, how can we reach the state G from S without\n",
    "visiting H? We learned that the optimal policy tells the agent to perform correct action in\n",
    "each state. So, if we find the optimal policy then we can reach the state G from S visiting the state H. Okay, how to find the optimal policy? We can use the value iteration method\n",
    "we just learned to find the optimal policy.\n",
    "\n",
    "\n",
    "Remember that all our states (S to G) will be encoded from 0 to 16 and all the four actions -\n",
    "left, down, up, right will be encoded from 0 to 3 in the gym toolkit.\n",
    "So, in this section, we will learn how to find the optimal policy using the value iteration\n",
    "method so that the agent can reach the state G from S without visiting H."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the frozen lake environment using the render function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[41mS\u001b[0mFFF\n",
      "FHFH\n",
      "FFFH\n",
      "HFFG\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can notice, our agent is in the state S and it has to reach the state G without visiting\n",
    "the states H. So, let's learn how to compute the optimal policy using the value iteration\n",
    "method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's learn how to compute the optimal value function and then we will see how to\n",
    "extract the optimal policy from the computed optimal value function. \n",
    "\n",
    "\n",
    "## Computing optimal value function\n",
    "\n",
    "We will define a function called `value_iteration` where we compute the optimal value\n",
    "function iteratively by taking maximum over Q function. For\n",
    "better understanding, let's closely look at the every line of the function and then we look at\n",
    "the complete function at the end which gives us more clarity.\n",
    "\n",
    "\n",
    "\n",
    "Define `value_iteration` function which takes the environment as a parameter: \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def value_iteration(env):\n",
    "\n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #set the threshold number for checking the convergence of the value function\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #we also set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table) \n",
    "             \n",
    "        #now, we compute the value function (state value) by taking the maximum of Q value.\n",
    "        \n",
    "        #thus, for each state, we compute the Q values of all the actions in the state and then\n",
    "        #we update the value of the state as the one which has maximum Q value as shown below:\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            Q_values = [sum([prob*(r + gamma * updated_value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                                        \n",
    "            value_table[s] = max(Q_values) \n",
    "                        \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as our optimal value function as shown below:\n",
    "    \n",
    "        if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):\n",
    "             break\n",
    "    \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, that we have computed the optimal value function by taking the maximum over Q\n",
    "values, let's see how to extract the optimal policy from the optimal value function. \n",
    "\n",
    "\n",
    "## Extracting optimal policy from the optimal value function\n",
    "\n",
    "In the previous step, we computed the optimal value function. Now, let see how to extract\n",
    "the optimal policy from the computed optimal value function.\n",
    "\n",
    "\n",
    "First, we define a function called `extract_policy` which takes the `value_table` as a\n",
    "parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "That's it! Now, we will see how to extract the optimal policy in our frozen lake\n",
    "environment. \n",
    "\n",
    "## Putting it all together\n",
    "We learn that in the frozen lake environment our goal is to find the optimal policy which\n",
    "selects the correct action in each state so that we can reach the state G from the state\n",
    "A without visiting the hole states.\n",
    "\n",
    "First, we compute the optimal value function using our `value_iteration` function by\n",
    "passing our frozen lake environment as the parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_value_function = value_iteration(env=env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we extract the optimal policy from the optimal value function using our\n",
    "extract_policy function as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = extract_policy(optimal_value_function)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the obtained optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to\n",
    "perform the correct action in each state. \n",
    "\n",
    "Now, that we have learned what is value iteration and how to perform the value iteration\n",
    "method to compute the optimal policy in our frozen lake environment, in the next section\n",
    "we will learn about another interesting method called the policy iteration. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
